Introduction

What is the bugRzilla Package?

BugRzilla is an R package that helps the user to interact with the Bugzilla through an API.

To learn more, see bugRzilla.

About the bugRzilla Google Summer of COde Project:-

bugRzilla is a package to interact with a bugzilla API and specially with R bugzilla. The goal of the project is to help users to submit issues to R Bugzilla.

About the This Project:-

Explore the issues and bugs on the R Bugzilla to make the submission from bugRzilla better. It might help to identify useful patterns for R core or report the status of the R Bugzilla.

To learn more, see bugzilla_viz.

Setup Database on your local system

Download SQL and MySQL Workbench

To install SQL on Ubuntu one can refer a blog post by digitalocean. To install MySQL workbench on Ubuntu one can refer a blog post by linuxhint

Download R_bugzilla data

  1. The R_bugzilla data can be downloaded from link.
  2. Since the downloaded data is a zip file so make sure you unzip the file by directly using extract here option to the folder you desire before dumping the file which will have an extension .sql (eg: R-bugs.sql).

Dump downloaded R_bugzilla to MySQL workbench.

Before one import the R_bugzilla SQL file one needs to create the (empty) database from MySQL if it doesn’t exist already and the exported SQL don’t contain CREATE DATABASE (exported with –no-create-db or -n option), before you can import it.

After considering this open your Terminal and run the command: mysqldump -u my_username -p database_name > output_file_path or you can use mysql using the command: source <Path>/R-bugs.sql;

The options in use are:
  1. The -u flag indicates that the MySQL username will follow.
  2. The -p flag indicates we should be prompted for the password associated with the above username. database_name is of course the exact name of the database to export. eg. bugRzilla is the empty database you created.
  3. The > symbol is a Unix directive for STDOUT, which allows Unix commands to output the text results of the issued command to another location. In this case, that output location is a file path, specified by output_file_path.
For Example,
  1. At the command prompt, run the following command to launch the mysql shell and enter it as the root user: mysql -u root -p
  2. When you’re prompted for a password, enter the one that you set at installation time, or if you haven’t set one, press Enter to submit no password. The following mysql shell prompt should appear: mysql>
  3. In MySQL, I used this to dump the data in the empty database:
    • Create an empty database: create database bugRzilla;
    • To check wheather the database is created or not use: show databases;
    • Once an empty database is created then to dump the SQL data in the database use: source /home/data/Documents/GSOC/R-bugs.sql;
    • To check your database is dumped correctly use: show tables;

      mysql> show tables;
      +---------------------+
      | Tables_in_bugRzilla |
      +---------------------+
      | attachments         |
      | bugs                |
      | bugs_activity       |
      | bugs_fulltext       |
      | bugs_mod            |
      | components          |
      | longdescs           |
      +---------------------+
      7 rows in set (0.00 sec)

bugRzilla Analysis

For the connection to the database, I’m using the dplyr package, it supports to the widely-used open source databases like MySQL.

The libraries used for the analysis:

# loading packages
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(dbplyr)
## 
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
library(RMySQL)
## Loading required package: DBI
library(DBI)
library(DT)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ stringr 1.4.0
## ✓ tidyr   1.1.3     ✓ forcats 0.5.1
## ✓ readr   1.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dbplyr::ident() masks dplyr::ident()
## x dplyr::lag()    masks stats::lag()
## x dbplyr::sql()   masks dplyr::sql()
library(skimr)
library(naniar)
## 
## Attaching package: 'naniar'
## The following object is masked from 'package:skimr':
## 
##     n_complete
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Connect bugRzilla SQL Database with R

# Connecting R with MySQL
con <- dbConnect(
    MySQL(),
    dbname='bugRzilla', # change the database name to your database name
    username='root', # change the username to your username
    password='1204', # update your password
    host='localhost',
    port=3306)

#  Accessing Tables names from the Database
DBI::dbListTables(con)
## [1] "attachments"   "bugs"          "bugs_activity" "bugs_fulltext"
## [5] "bugs_mod"      "components"    "longdescs"

Data Exploartion of Bugs Table from the Database

bugs_df <- tbl(con, "bugs")
## Warning in .local(conn, statement, ...): Decimal MySQL column 24 imported as
## numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 25 imported as
## numeric
#for quick view of the datatypes and the structure of data
skim(bugs_df)
## Warning in .local(conn, statement, ...): Decimal MySQL column 24 imported as
## numeric

## Warning in .local(conn, statement, ...): Decimal MySQL column 25 imported as
## numeric
Data summary
Name bugs_df
Number of rows 7042
Number of columns 27
_______________________
Column type frequency:
character 15
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
bug_file_loc 0 1 0 136 6990 51 0
bug_severity 0 1 5 11 0 7 0
bug_status 0 1 3 11 0 8 0
creation_ts 0 1 19 19 0 7028 0
delta_ts 0 1 19 19 0 6308 0
short_desc 0 1 1 255 0 6923 0
op_sys 0 1 3 15 0 22 0
priority 0 1 2 2 0 5 0
rep_platform 0 1 3 25 0 7 0
version 0 1 3 15 0 43 0
resolution 0 1 0 19 564 12 0
target_milestone 0 1 3 3 0 1 0
status_whiteboard 0 1 0 0 7042 1 0
lastdiffed 0 1 19 19 0 6324 0
deadline 7008 0 19 19 0 30 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bug_id 0 1 10817.89 6189.36 1 5686.75 14101.5 16048.75 18097 ▃▁▂▂▇
assigned_to 0 1 17.48 120.26 1 2.00 5.0 16.00 2787 ▇▁▁▁▁
product_id 0 1 2.00 0.00 2 2.00 2.0 2.00 2 ▁▁▇▁▁
reporter 0 1 685.69 1003.34 1 2.00 2.0 1056.00 3432 ▇▂▁▁▁
component_id 0 1 9.84 5.20 2 6.00 9.0 15.00 19 ▇▇▆▃▆
qa_contact 7042 0 NaN NA NA NA NA NA NA
votes 0 1 0.00 0.00 0 0.00 0.0 0.00 0 ▁▁▇▁▁
everconfirmed 0 1 0.83 0.38 0 1.00 1.0 1.00 1 ▂▁▁▁▇
reporter_accessible 0 1 1.00 0.00 1 1.00 1.0 1.00 1 ▁▁▇▁▁
cclist_accessible 0 1 1.00 0.00 1 1.00 1.0 1.00 1 ▁▁▇▁▁
estimated_time 0 1 0.10 6.60 0 0.00 0.0 0.00 552 ▇▁▁▁▁
remaining_time 0 1 0.00 0.00 0 0.00 0.0 0.00 0 ▁▁▇▁▁
From the above table we can conclude that the few of the columns are having wrong datatype like:
  1. creation_ts
  2. delta_ts
  3. lastdiffed
  4. estimated_time
  5. remaining_time
  6. deadline
Note:The Column estimated_time and remaining_time only contains the integer value. So, It can’t be transformed to Date format datatype. Also there are columns which are empty so they are of no use of the analysis like:
  1. target_milestone
  2. qa_contact
  3. status_whiteboard

# Converting `bugs_df` to `dataframe`
bugs_df <- as.data.frame(bugs_df)
## Warning in .local(conn, statement, ...): Decimal MySQL column 24 imported as
## numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 25 imported as
## numeric

Cleaning the data

First steps, check the data and prepare it for what we want:

#converting the required fields in the correct datatype format
bugs_df <- bugs_df %>%
    mutate_at(vars("creation_ts", "delta_ts", "lastdiffed", "deadline"), as.Date)
# Taking the columns which are useful
bugs_df <- bugs_df %>%
    select("bug_id", "bug_severity", "bug_status", "creation_ts", "delta_ts", "op_sys", "priority", "resolution", "component_id", "version", "lastdiffed", "deadline")
#for quick view of the datatypes and the structure of data
skim(bugs_df)
Data summary
Name bugs_df
Number of rows 7042
Number of columns 12
_______________________
Column type frequency:
character 6
Date 4
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
bug_severity 0 1 5 11 0 7 0
bug_status 0 1 3 11 0 8 0
op_sys 0 1 3 15 0 22 0
priority 0 1 2 2 0 5 0
resolution 0 1 0 19 564 12 0
version 0 1 3 15 0 43 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
creation_ts 14 1 1998-08-07 2021-05-07 2009-12-08 4274
delta_ts 30 1 1998-08-09 2021-05-08 2012-07-20 3562
lastdiffed 14 1 1998-08-07 2021-05-08 2012-07-10 3565
deadline 7008 0 2010-04-23 2015-04-23 2013-11-09 30

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bug_id 0 1 10817.89 6189.36 1 5686.75 14101.5 16048.75 18097 ▃▁▂▂▇
component_id 0 1 9.84 5.20 2 6.00 9.0 15.00 19 ▇▇▆▃▆
#showing the `datatable`
datatable(head(bugs_df, 5), options = list(scrollX = TRUE))

About the Bugs Data used for Analysis

I’ve taken the 12 columns under consideration to Analyse the Data. The brief description about the columns as follows:
  1. bug_id: Unique numeric identifier for bug.
  2. bug_severity: How severe the bug is, e.g. enhancement, critical, etc.
  3. bug_status: Current status, e.g. NEW, RESOLVED, etc.
  4. creation_ts: When bug was filed.
  5. delta_ts: The timestamp of the last update on the bug. This includes updates to some related tables (e.g. “longdescs”).
  6. op_sys: Operating system bug was seen on, e.g. Windows Vista, Linux, etc.
  7. priority: The priority of the bug (P1 = most urgent, P5 = least urgent).
  8. resolution: The resolution, if the bug is in a closed state, e.g. FIXED, DUPLICATE, etc.
  9. component_id: Numeric ids of the components.
  10. version: Version of software in which bug is seen.
  11. lastdiffed: The time at which information about this bug changing was last emailed to the cc list.
  12. deadline: Date by which bug must be fixed.

Visualizations

# Plotting the Bar graph and adding Trace of Time-Series graph with bug_id and creation_ts to see the spread
data <- data.frame(bugs_df$bug_id, bugs_df$creation_ts)
fig1 <- plot_ly(data,
                x = ~bugs_df$creation_ts,
                y = ~bugs_df$bug_id,
                type = 'scatter',
                mode = 'markers')
fig1 <- plot_ly(data,
                x = ~bugs_df$creation_ts,
                y = ~bugs_df$bug_id,
                type = 'bar',
                name = "bug_creation bar")
fig1 <- fig1 %>%
    add_trace(fig1,
              type = 'scatter',
              mode='lines+markers',
              name = "bug_creation Time_series")
fig1
## Warning: Ignoring 14 observations

From the above the visualizations, The Time-series graph shows that which bug_id was filed in which month and year and from the bar graph we can conclude that in which year the most bugs are filed and when one will zoom the graphs, one can see on which date which bug was filed. The most of the Bugs are filled in the month of January and July.

# Plotting the Time Series graph with the bug_id and delta_ts
data <- data.frame(bugs_df$bug_id, bugs_df$delta_ts)
fig2 <- plot_ly(data, 
                x = ~bugs_df$delta_ts, 
                y = ~bugs_df$bug_id, 
                type = 'scatter', 
                mode = 'markers')
fig2
## Warning: Ignoring 30 observations

From the above the visualizations, The Time-series graph shows that which bug_id was last update. Most of the bugs are last updated in the month of January,March, April, and July.

# Plotting bar graph with bug_id and resolution
data <- data.frame(bugs_df$bug_id, bugs_df$resolution)
fig4 <- plot_ly(data,
                x = ~bugs_df$resolution,
                y = ~bugs_df$bug_id,
                type = 'bar')
fig4

From the above the visualizations, The Resolution bar-graph shows that which bug_id belongs to which category of resolution, if the bug is in a closed state, e.g. FIXED, DUPLICATE, etc. As we can conclude, that most bugs belongs to the fixed category of the resolution.

# Plotting bar graph with bug_id and bug_status
data <- data.frame(bugs_df$bug_id, bugs_df$bug_status)
fig5 <- plot_ly(data,
                x = ~bugs_df$bug_status,
                y = ~bugs_df$bug_id,
                type = 'bar')
fig5

From the above the visualizations, The bug_status bar-graph shows that which bug_id belongs to which category of bug_status, e.g. NEW, RESOLVED, etc. As we can conclude, that most bugs belongs to the closed category of the bug_status.

# Plotting bar graph with bug_id and bug_severity
data <- data.frame(bugs_df$bug_id, bugs_df$bug_severity)
fig6 <- plot_ly(data,
                x = ~bugs_df$bug_severity,
                y = ~bugs_df$bug_id,
                type = 'bar')
fig6

From the above the visualizations, The bug_severity bar-graph shows that which bug_id belongs to which category of bug_severity. Most of the bug which are filed are normal, some of the bugs which are filled under enhancements are retested for some features, minor and major and a very few bugs are filed under the blocker category.

Data Exploartion of bugs and Attachments Table from the Database

bugs_attach_df <- tbl(con, "attachments")
# Converting `bugs_attach_df` to `dataframe`
bugs_attach_df <- as.data.frame(bugs_attach_df)
#for quick view of the datatypes and the structure of data
skim(bugs_attach_df)
Data summary
Name bugs_attach_df
Number of rows 1823
Number of columns 11
_______________________
Column type frequency:
character 5
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
creation_ts 0 1 19 19 0 1771 0
modification_time 0 1 19 19 0 1630 0
description 0 1 0 174 187 1485 0
mimetype 0 1 8 71 0 69 0
filename 0 1 3 70 0 1522 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
attach_id 0 1 1876.50 572.77 1 1362.5 1895 2380.5 2838 ▁▃▇▇▇
bug_id 0 1 15351.15 3661.05 1 15004.0 16413 17369.0 18097 ▁▁▁▁▇
ispatch 0 1 0.43 0.49 0 0.0 0 1.0 1 ▇▁▁▁▆
submitter_id 0 1 1313.33 1104.28 1 317.0 979 2143.0 3432 ▇▆▂▃▃
isobsolete 0 1 0.12 0.32 0 0.0 0 0.0 1 ▇▁▁▁▁
isprivate 0 1 0.00 0.00 0 0.0 0 0.0 0 ▁▁▇▁▁

Cleaning attachments Data

bugs_attach_df <- bugs_attach_df %>%
    mutate_at(vars("creation_ts", "modification_time"), as.Date,) %>%
    mutate_at(vars("isobsolete", "isprivate", "ispatch"), as.logical)

Joining the bugs and attachments tables

#joining the `attachments` and `bugs` table
baa <- merge(bugs_attach_df, bugs_df, by = intersect(names(bugs_attach_df), names(bugs_df)), all = TRUE)

# Created four columns `creation_month`, `creation_year` and `lastdiffed_month`, `lastdiffed_year` to find in which month and year a bug is created and modified respectively.
baa <- baa %>%
    mutate(creation_month = format(creation_ts, "%m"), creation_year = format(creation_ts, "%Y"), lastdiffed_month = format(lastdiffed, "%m"), lastdiffed_year = format(lastdiffed, "%Y")) %>%
    group_by(creation_month, creation_year)
datatable(head(baa, 5), options = list(scrollX = TRUE))

About the bugs_activity and attachments Data Used for Analysis

I’ve taken the 9 columns under consideration to Analyse the Data. The brief description about the columns as follows:
  1. bug_id: Unique numeric identifier for bug.
  2. attach_id: Unique numeric identifier for attachment.
  3. creation_ts: When bug was filed.
  4. modification_time: The date and time on which the attachment was last modified.
  5. description: Text describing the attachment.
  6. mimetype: Content type of the attachment like text/plain or image/png.
  7. ispatch: Whether attachment is a patch.
  8. filename :Path-less file-name of attachment.
  9. submitter_id: Unique numeric identifier for who submitted the bug.
  10. isobsolete: Whether attachment is marked obsolete.
  11. isprivate: TRUE if the attachment should be private and FALSE if the attachment should be public.
  12. creation_month: The month in which the bug is created.
  13. creation_year: The year in which the bug is created.
  14. lastdiffed_month: The month in which the bug is last modified.
  15. lastdiffed_year: The year in which the bug is last modified.

Visualizations

#filtering the data where resolution is Duplicate
res_dupli <- baa %>%
    filter(resolution == "DUPLICATE")

# plotting graph with creation month where resolution is Duplicate
fig7 <- ggplot(res_dupli) +
    geom_bar(aes(x = creation_month))
ggplotly(fig7)
# plotting graph with creation year where resolution is Duplicate
fig8 <- ggplot(res_dupli) +
    geom_bar(aes(x = creation_year))
ggplotly(fig8)
# data <- data.frame(res_dupli$creation_year, res_dupli$creation_month)
# fig7 <- res_dupli %>%
#     plot_ly(
#         data = data,
#         x = ~res_dupli$creation_year,
#         y = ~res_dupli$creation_month,
#         type = 'scatter',
#         mode = 'markers'
#     )
# fig7 <- plot_ly(data,
#                 x = ~res_dupli$creation_year,
#                 y = ~res_dupli$creation_month,
#                 type = 'bar',
#                 name = "Duplicate_Bugs Bar")
# fig7 <- fig7 %>%
#     layout(
#         title = "Duplicate Bugs",
#         yaxis = list(title = "creation_month"),
#         xaxis = list(title = "creation_year"),
#         autosize = FALSE, margin = res_dupli$creation_year
#     )
# fig7

The above above Visualization is about the when the Duplicate bugs were filed, from the graph we can see that the most wast filled from 2011 to 2014. In January and July the frequency of the Duplicate bugs are more than other months of the year.

#filtering the data where resolution is Fixed
res_fixed <- baa %>%
    filter(resolution == "FIXED")

# plotting graph with creation year where resolution is Fixed
fig9 <- ggplot(res_fixed) +
    geom_bar(aes(x = creation_month))
ggplotly(fig9)
# plotting graph with creation year where resolution is Fixed
fig10 <- ggplot(res_fixed) +
    geom_bar(aes(x = creation_year))
ggplotly(fig10)
# 
# data <- data.frame(res_fixed$creation_month, res_fixed$creation_year)
# fig9 <- res_fixed %>% 
#     plot_ly(
#         data = data,
#         y = ~res_fixed$creation_month, 
#         x = ~res_fixed$creation_year,
#         type = 'scatter',
#         mode = 'markers'
#     )
# fig9 <- plot_ly(data,
#                 x = ~res_fixed$creation_year,
#                 y = ~res_fixed$creation_month,
#                 type = 'bar',
#                 name = "Duplicate_Bugs Bar")
# fig9 <- fig9 %>%
#     layout(
#         title = "Fixed Bugs",
#         yaxis = list(title = "creation_month"),
#         xaxis = list(title = "creation_year"),
#         autosize = FALSE, margin = res_fixed$bug_id
#     )
# fig9

The above above Histogram graph is about the how many weeks does a bug to get it fixed. From the graph we can see that most of the bug are fixed under 150 weeks.

res_invalid <- baa %>%
    filter(resolution == "INVALID")
fig11 <- ggplot(res_fixed) +
    geom_bar(aes(x = creation_month))
ggplotly(fig11)
# data <- data.frame(res_invalid$bug_id, res_invalid$creation_ts)
# fig9 <- plot_ly(
#     data,
#     x = ~res_invalid$creation_ts,
#     y = ~res_invalid$bug_id,
#     type = 'bar',
#     name = "bug_creation bar"
# )
# fig9 <- fig9 %>%
#     add_trace(
#         fig7,
#         type = 'scatter',
#         mode='lines+markers',
#         name = "bug_creation Time_series"
#     )
# fig9 <- fig9 %>%
#     layout(
#         title = "Invalid Bugs",
#         yaxis = list(title = "Bug_id"),
#         xaxis = list(title = "Creation_time"),
#         autosize = FALSE, margin = res_invalid$bug_id
#     )
# fig9

This Visualization refers to the Creation of the Invalid bugs. In the Year 2010 to 2016 the Invalid bugs are filed.

Data Exploartion of bugs_mod Table from the Database

bugs_mod_df <- tbl(con, "bugs_mod")
# Converting `bugs_mod_df to `dataframe`
bugs_mod_df <- as.data.frame(bugs_mod_df)
#for quick view of the datatypes and the structure of data
skim(bugs_mod_df)
Data summary
Name bugs_mod_df
Number of rows 7042
Number of columns 28
_______________________
Column type frequency:
character 18
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
row_names 0 1.00 1 4 0 7042 0
bug_file_loc 7042 0.00 NA NA 0 0 0
bug_severity 0 1.00 5 11 0 7 0
bug_status 0 1.00 3 11 0 8 0
creation_ts 14 1.00 10 10 0 4274 0
delta_ts 7042 0.00 NA NA 0 0 0
short_desc 0 1.00 1 255 0 6923 0
op_sys 7042 0.00 NA NA 0 0 0
priority 0 1.00 2 2 0 5 0
rep_platform 0 1.00 3 25 0 7 0
version 0 1.00 3 15 0 43 0
resolution 564 0.92 4 19 0 11 0
target_milestone 0 1.00 3 3 0 1 0
status_whiteboard 7042 0.00 NA NA 0 0 0
lastdiffed 7042 0.00 NA NA 0 0 0
estimated_time 0 1.00 4 6 0 19 0
remaining_time 0 1.00 4 4 0 1 0
deadline 7008 0.00 10 10 0 30 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bug_id 0 1 10817.89 6189.36 1 5686.75 14101.5 16048.75 18097 ▃▁▂▂▇
assigned_to 0 1 17.48 120.26 1 2.00 5.0 16.00 2787 ▇▁▁▁▁
product_id 0 1 2.00 0.00 2 2.00 2.0 2.00 2 ▁▁▇▁▁
reporter 0 1 685.69 1003.34 1 2.00 2.0 1056.00 3432 ▇▂▁▁▁
component_id 0 1 9.84 5.20 2 6.00 9.0 15.00 19 ▇▇▆▃▆
qa_contact 7042 0 NaN NA NA NA NA NA NA
votes 0 1 0.00 0.00 0 0.00 0.0 0.00 0 ▁▁▇▁▁
everconfirmed 0 1 0.83 0.38 0 1.00 1.0 1.00 1 ▂▁▁▁▇
reporter_accessible 0 1 1.00 0.00 1 1.00 1.0 1.00 1 ▁▁▇▁▁
cclist_accessible 0 1 1.00 0.00 1 1.00 1.0 1.00 1 ▁▁▇▁▁
#showing the baa i.e `bugs_mod_df` table in the `datatable`
datatable(head(bugs_mod_df, 5), options = list(scrollX = TRUE))

Data Exploartion of longdescs Table from the Database

longdescs_df <- tbl(con, "longdescs")
## Warning in .local(conn, statement, ...): Decimal MySQL column 4 imported as
## numeric
# Converting `longdescs_df` to `dataframe`
longdescs_df <- as.data.frame(longdescs_df)
## Warning in .local(conn, statement, ...): Decimal MySQL column 4 imported as
## numeric
#for quick view of the datatypes and the structure of data
skim(longdescs_df)
Data summary
Name longdescs_df
Number of rows 26942
Number of columns 11
_______________________
Column type frequency:
character 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
bug_when 0 1.00 19 19 0 26270 0
thetext 0 1.00 0 422285 772 25588 0
extra_data 24966 0.07 1 5 0 1948 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
comment_id 0 1 83378.70 7986.99 1 76528.25 83263.5 90215.75 97284 ▁▁▁▃▇
bug_id 0 1 10479.44 6260.77 1 4195.00 13361.0 16072.00 18097 ▅▁▃▂▇
who 0 1 457.47 896.85 1 2.00 2.0 412.00 3432 ▇▁▁▁▁
work_time 0 1 0.00 0.04 0 0.00 0.0 0.00 5 ▇▁▁▁▁
isprivate 0 1 0.00 0.00 0 0.00 0.0 0.00 0 ▁▁▇▁▁
already_wrapped 0 1 0.00 0.00 0 0.00 0.0 0.00 0 ▁▁▇▁▁
type 0 1 0.35 1.26 0 0.00 0.0 0.00 6 ▇▁▁▁▁
is_markdown 0 1 0.04 0.20 0 0.00 0.0 0.00 1 ▇▁▁▁▁
#showing the baa i.e `longdescs_df` table in the `datatable`
datatable(head(longdescs_df, 5), options = list(scrollX = TRUE))
dbDisconnect(con )
## [1] TRUE